Skip to content

Conversation

amaanq
Copy link

@amaanq amaanq commented Oct 7, 2025

Problem

The default timeout when looking up SRV records for the dns client is 60 seconds, which is quite long and can be problematic as that's the value of the default http client timeout, thus the error surfaced is the http error and not the underlying DNS error.

Solution

I've added a more aggressive timeout of 15 seconds total (with retries after 1, 3, 3, 3, and 5 seconds) for the SRV lookup. I've also re-labeled the `defer.TimeoutError, as that was done in the prior synapse PR, and makes the error more clear to users.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@amaanq amaanq requested a review from a team as a code owner October 7, 2025 19:34
@CLAassistant
Copy link

CLAassistant commented Oct 7, 2025

CLA assistant check
All committers have signed the CLA.

@amaanq amaanq force-pushed the srv-timeout branch 3 times, most recently from 5059cd5 to aaacc87 Compare October 7, 2025 23:13
raise e
except defer.TimeoutError as e:
raise defer.TimeoutError(
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout=15s)"

Does this sound better?

Only real nit is this previously said 50s total vs our new 15s timeout. Ideally, we'd have a constant to use here but I'm not sure that moving timeout=(1, 3, 3, 3, 5) to the top as a constant is better. The comments probably read better in place where it's used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say to make it a constant, then its easy to do the following:

Suggested change
f"Could not resolve DNS for SRV record {service_name!r} due to timeout (50s total)"
f"Timed out while trying to resolve DNS for SRV record for {service_name!r} (timeout={sum(LOOKUP_TIMEOUTS)}s)"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Jonathan's suggestion of using a constant, I've gone ahead and applied that suggestion :)

return list(cache_entry)
else:
raise e
except defer.TimeoutError as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a test that stressed this part of the code. Especially since I'm unsure if defer.TimeoutError is actually the exception type raised here.

Do you think you would be up for that? Probably involves reactor.advance(15 + 1) to advance time past the timeout. Otherwise, I can take a stab at it.

Copy link
Contributor

@ShadowJonathan ShadowJonathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 seconds may be short, but I think that if a server isn't responding inbetween those multi-second retries, its having other issues anyways.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

synapse.http.federation.srv_resolver.SrvResolver.resolve_service isn't able to "timeout" properly, and thus stalls federation

4 participants